-------------------------------------------------------------------------------------------

PART 1

-------------------------------------------------------------------------------------------

Question:1

  1. Question: Please refer the table below to answer below questions:
Planned to purchase Product A Actually placed and order for Product A - Yes Actually placed and order for Product A - No Total
Yes 400 100 500
No 200 1300 1500
Total 600 1400 2000
A. Refer to the above table and find the joint probability of the people who planned to purchase and actually placed an order.
B. Refer to the above table and find the joint probability of the people who planned to purchase and actually placed an order, given that people planned to purchase.

ANSWER :

A. From the table , planned to purchase and actually placed an order = 400 Total people = 2000

P(planned to purchase and actually placed an order) = planned to purchase and actually placed an order/total

                                                    = 400/2000

                                                    = 1/5

                                                    = 0.2

B. P(actually placed an order/planned to purchase) = actually placed an order / total planned to purchase

                                                = 400/500 = 4/5 = 0.8

Question:2

  1. An electrical manufacturing company conducts quality checks at specified periods on the products it manufactures. Historically, the failure rate for the manufactured item is 5%. Suppose a random sample of 10 manufactured items is selected. Answer the following questions.

     A. Probability that none of the items are defective?
     B. Probability that exactly one of the items is defective?
     C. Probability that two or fewer of the items are defective? 
     D. Probability that three or more of the items are defective ?

ANSWER: N(Sample size) : 10 Probability of failure/defective(5%) = 0.05 Probability of Success/non-defective(95%) = 0.95

A. P(No Defective item) = 10C0 * (P(Defective))^0 * (P(Non-Defective))^10
                        = 1* (0.05)^0 * 0.95^10
                        = 1* 1* 0.5987369392383787
                        = 0.5987369392383787

B. P(exactly one of the items is defective) = 10C1 * P(Defective)^1 *P(Non-Defective))^9
                                            = 10 * 0.05 * 0.95 ^10
                                            = 0.31512470486230454

C. P(two or fewer of the items are defective) = P(No item defective ) + P(1 item defective ) + P(2 item defective)
                                              = 10C0 * (P(Defective))^0 * (P(Non-Defective))^10 + 10C1 * P(Defective)^1 *P(Non-Defective))^9 + 10C2 * P(Defective)^2 *P(Non-Defective))^8
                                              = 0.5987369392383787 + 0.31512470486230454 + 45 *(0.05)^2 * 0.95^8
                                              = 0.5987369392383787 + 0.31512470486230454 + 0.07463479852001952
                                              = 0.9884964426207028

D. P(three or more of the items are defective ) = 1-P(two or fewer of the items are defective)
                                                = 1- 0.9884964426207028
                                                = 0.011503557379296881 

Question:3

  1. Question: A car salesman sells on an average 3 cars per week.

     A. Probability that in a given week he will sell some cars.
     B. Probability that in a given week he will sell 2 or more but less than 5 cars.
     C. Plot the poisson distribution function for cumulative probability of cars sold per-week vs number of cars sold per-week.

Question: 4

  1. Question: Accuracy in understanding orders for a speech based bot at a restaurant is important for the Company X which has designed, marketed and launched the product for a contactless delivery due to the COVID-19 pandemic. Recognition accuracy that measures the percentage of orders that are taken correctly is 86.8%. Suppose that you place order with the bot and two friends of yours independently place orders with the same bot. Answer the following questions.

     A. What is the probability that all three orders will be recognised correctly?
     B. What is the probability that none of the three orders will be recognised correctly?
     C. What is the probability that at least two of the three orders will be recognised correctly?

ANSWER

A. What is the probability that all three orders will be recognised correctly?

    P(all three orders will be recognised correctly) = 3C3 *(P(Correct))^3 * (1-P(Correct))^0
                                                     = 1 * (0.868)^3 * 1
                                                     = 0.653972032

B. What is the probability that none of the three orders will be recognised correctly)?

    P(none of the three orders will be recognised correctly) = 3C0 *P(Correct)^0 * (1-P(Correct))^3
                                                             = 1* 1 * (1- 0.868)^3
                                                             = 0.0022999680000000003

C. What is the probability that at least two of the three orders will be recognised correctly?

    P(at least two of the three orders will be recognised correctly) 
                                     = P(2 order recognised correctly) + P(3 order recognised correctly)
                                     = 3C2 *P(Correct)^2 * (1-P(Correct))^1 + 3C3 *P(Correct)^3 * (1-P(Correct))^0
                                     = 3 * 0.868 ^ 2 * (1-0.868 )^1 + 1 * 0.868 ^ 3 * (1-0.868 )^0
                                     = 3 * 0.868 ^ 2 * (0.132)^1 + 1 * 0.868 ^ 3 * 1
                                     = 0.952327936

Question 5

  1. Question: A group of 300 professionals sat for a competitive exam. The results show the information of marks obtained by them have a mean of 60 and a standard deviation of 12. The pattern of marks follows a normal distribution. Answer the following questions.
     A. What is the percentage of students who score more than 80.
     B. What is the percentage of students who score less than 50.
     C. What should be the distinction mark if the highest 10% of students are to be awarded distinction?

Question 6

  1. Question: Explain 1 real life industry scenario [other than the ones mentioned above] where you can use the concepts learnt in this module of Applied statistics to get a data driven business solution.

ANSWER :

  1. Detecting tumors in brain scan. Getting the data on the brain tumor scan from past patients data, we can anylyse the data to predict the location and shape of the tumors. Using the analysis we can determine the relationship between various details like life expenctency, recommended type of treatement, size of tumor, location of tumor, level of tumors and other details. This will help us contructing hypothesis and predicting the probable solution for Health care to focus on and help in preliminary diagnosis.

  2. Companies use statistics in market research and new product development. We can take random surveys of consumers to gauge the market acceptance and potential for a proposed product. We can pitch executives if there will be enough demand for the product. Is there enough demand to justify spending money to develop the product and, ultimately, to build a plant to produce it? From the statistical analysis, a break-even model is constructed to determine the volume of sales necessary for the product to succeed.

-------------------------------------------------------------------------------------------

PART TWO :

-------------------------------------------------------------------------------------------

PART TWO :

• DOMAIN: Sports
• CONTEXT: Company X manages the men's top professional basketball division of the American league system. The dataset contains information on all the teams that have participated in all the past tournaments. It has data about how many baskets each team scored, conceded, how many times they came within the first 2 positions, how many tournaments they have qualified, their best position in the past, etc.
• DATA DESCRIPTION: Basketball.csv - The data set contains information on all the teams so far participated in all the past tournaments.
• ATTRIBUTE INFORMATION:
    1. Team: Team’s name
    2. Tournament: Number of played tournaments.
    3. Score: Team’s score so far.
    4. PlayedGames: Games played by the team so far.
    5. WonGames: Games won by the team so far.
    6. DrawnGames: Games drawn by the team so far.
    7. LostGames: Games lost by the team so far.
    8. BasketScored: Basket scored by the team so far.
    9. BasketGiven: Basket scored against the team so far.
    10. TournamentChampion: How many times the team was a champion of the tournaments so far.
    11. Runner-up: How many times the team was a runners-up of the tournaments so far.
    12. TeamLaunch: Year the team was launched on professional basketball.
    13. HighestPositionHeld: Highest position held by the team amongst all the tournaments played.
• PROJECT OBJECTIVE: Company’s management wants to invest on proposal on managing some of the best teams in the league. The analytics department has been assigned with a task of creating a report on the performance shown by the teams. Some of the older teams are already in contract with competitors. Hence Company X wants to understand which teams they can approach which will be a deal win for them.

EDA for PART TWO :

2.1 Load libraries :

2.2. Import the dataset(DS - Part2 - Basketball.csv):

Here we can see that there are 13 rows. 
Team is categorical and unique data here. 
Rest all data are numberical. 
TeamLaunch is temporal data containing year or duration

2.3 Check the Dimension of data?

The dataset has 61 rows and 13 columns

2.4 Check the Information about the data and the datatypes of each respective attributes.

As we can see most of the data are object data type though their actual value is numerical. Hence we have to convert these data to numerical so that we can use it for processing further. 
**This will be done later point**

2.5 Data preprocessing

No Null Data Present

No duplicate data found

No Null Data Present

Since TournamentChampion(48 -) and Runner-up(52 -) are having most entries as '-', we will be replacing with 0 which means they havent won any tournament or runner-up. This will be best solution as replacing with average or removing the data will not be apt solution.

If we remove the data with '-', we will have only 9 data or lesser left which cant be analyse due to less sample size

Hence replacing the blank data '-' with value 0 as we are having missing data on them(Assuming they havent played matches). Hence making data type uniform.

Since the column TeamLaunch data is not uniform and contain year and duration, creating a new column with capturing the startng year by parsing the data on the column

Convert the object data to numeric so that the functions can be applied

Insert colums below for better analysis

1. Percentage_won = WonGames /PlayedGames *100
2. Percentage_BasketScored = BasketScored/ (BasketScored + BasketGiven) *100
3. Finalist = TournamentChampion + Runner-up
4. percentage_Finalist = (TournamentChampion + Runner-up ) / Tournament *100
5. percentage_TournamentChampion = (TournamentChampion) / Tournament
Analysed basic structure of data
Now we have completed the data filtering and processing.
We have removed duplicate and null data.
Replaced data.
Processed the duration to start year.
Changed data type.
Introduced new columns for better analysis.
We are not sure here about the missing data '-' , which could have been improved with some numberical value. Columns like TournamentChampion and Runner-up have lots of missing data which makes it difficult to analyse.

2.6 Descriptive Statistics

Mean

Median

Since there are no categorical data Mode is not required

Quartiles

It can be observed that there are outliers in the this case

IQR

Range

Variance

Standard Deviation

Check Covariance and Correlation

It can be observed that most of the data are right skewed. Percentage scored and percentage won is normal Data are correlated and linear in nature with outliers also

It can be seen that score is highly correlated to most of the variables shown in correlation graph

Check the Skewness

As shown, most frequent values are low and tail is towards high values.

Univariate Analysis

As shown in histogram, frequency of team is distributed around different scores. There are team with highers scores like greater than 1K, 2K, 3K and 4K. Since score is fairly good indicator of performance(as shown in correlation) we can focus on team with higher scores.

As shown in histogram, there are team with higher percentage won. Majority of team are around ~30% win percentage. We can focus on team with higher win percentage. As win percentage can be really good indicator of performance.

Percentage_BasketScored is ratio of scored to total basket(scored+given). This indicate that majority of team are around 35-50%. We can focus on team with higher percentage of basket scored.

Percentage finalist, finalist and tournament championship is good indicator for the team performance. But since most of the team hasnt been finalist, it filters out most of team. Though finalist in tounanament is good indicator but with given data only few old teams are getting into finals

Here we should focus on the teams with highest rank held. We are seeing high freqency of team with good performances. Company can invest on these teams as they have talent and capability to be on top ranking.

With these graphical representation we get fair idea about the performance of different team and we can make decision on investing on team within each category.

Multivariate analysis

We can focus on team with higher scores like Team1 , Team2 and so on as they have fairly high scores.

  1. It can be seen that score is highly correlated to most of the variables shown in correlation graph
  2. As shown in heat map all the variable except Highest Position held and Team launch start year all the data are related directly and Highest Position held and Team launch start year is inversely related as expected
  3. But the score is also correlated to losses and negative items

There are few spikes also which indicate the team with good percentage won. We can focus on team with higher percentage won like Team1 , Team2 and so on as they have fairly high scores.

Percentage tournament champions shows fewer team with high percentage. This shows fewer team are dominating on tournaments.

There are few spikes also which indicate the team with good percentage basket scored. We can focus on team with higher percentage basket scored like Team1 , Team2 and so on as they have fairly high scores.

The graph shows earlier the team launched higher their scores. We can focus on the team with launch year earlier

We can focus on team that are launch on different year but with higher scores.

The finalist are basiclly restricted to fewer team.

We can focus on team with good percentage finalist like team1, team 2 and so on

We can focus on team with higher championship like team1, team 2 and so on

Here we can focus on team like 1,2,3,4 and so on with faily good HighestPositionHeld.

Data analysis: 

    1. Lots of missing data and size of data is less. Once we remove the missing data, there are only 8-10 row of data remaining.
    2. All the rows having high correlation means lesser variability or richness of data.
    3. As in report, most data are uniform.
    4. There is high cardinality in data which makes them difficult to categorise or group.
    5. Data is highly skewed.

Summary :

ANALYSIS

Descriptive Statistics

It can be seen that score is highly correlated to most of the variables shown in correlation graph
As shown in heat map all the variable except Highest Position held and Team launch start year all the data are related directly and Highest Position held and Team launch start year is inversely related as expected

It can be observed that most of the data are right skewed. Percentage scored and percentage won is normal Data are correlated and linear in nature with outliers also

As shown, most frequent values are low and tail is towards high values.

Univariate Analysis

As shown in histogram, frequency of team is distributed around different scores. There are team with highers scores like greater than 1K, 2K, 3K and 4K. Since score is fairly good indicator of performance(as shown in correlation) we can focus on team with higher scores.


As shown in histogram, there are team with higher percentage won. Majority of team are around ~30% win percentage. We can focus on team with higher win percentage. As win percentage can be really good indicator of performance.

Percentage_BasketScored is ratio of scored to total basket(scored+given). This indicate that majority of team are around 35-50%. We can focus on team with higher percentage of basket scored.

Percentage finalist, finalist and tournament championship is good indicator for the team performance. But since most of the team hasnt been finalist, it filters out most of team. Though finalist in tournament is good indicator but with given data only few old teams are getting into finals

Here we should focus on the teams with highest rank held. We are seeing high freqency of team with good performances. Company can invest on these teams as they have talent and capability to be on top ranking.

With these graphical representation we get fair idea about the performance of different team and we can make decision on investing on team within each category.

Multivariate analysis

We can focus on team with higher scores like Team1 , Team2 and so on as they have fairly high scores.

1. It can be seen that score is highly correlated to most of the variables shown in correlation graph
2. As shown in heat map all the variable except Highest Position held and Team launch start year all the data are related directly and Highest Position held and Team launch start year is inversely related as expected
3. But the score is also correlated to losses and negative items

There are few spikes also which indicate the team with good percentage won. We can focus on team with higher percentage won like Team1 , Team2 and so on as they have fairly high scores.

Percentage tournament champions shows fewer team with high percentage. This shows fewer team are dominating on tournaments.

There are few spikes also which indicate the team with good percentage basket scored. We can focus on team with higher percentage basket scored like Team1 , Team2 and so on as they have fairly high scores.

The graph shows earlier the team launched higher their scores. We can focus on the team with launch year earlier

We can focus on team that are launch on different year but with higher scores.


The finalist are basically restricted to fewer team.

We can focus on team with good percentage finalist like team1, team 2 and so on

We can focus on team with higher championship like team1, team 2 and so on

Here we can focus on team like 1,2,3,4 and so on with fairy good HighestPositionHeld.

Hence we can conclude that the teams like Team 1,2,3,4 and so on since it has good record over the years and highest percentage and number of scores, winning, finalist, basket and tournament championship. 



Data preprocessing

No Null Data Present

No duplicate data found

No Null Data Present

Since TournamentChampion(48 -) and Runner-up(52 -) are having most entries as '-', we will be replacing with 0 which means they havent won any tournament or runner-up. This will be best solution as replacing with average or removing the data will not be apt solution.

Replacing the blank data '-' with value 0 as we are having missing data on them(Assuming they havent played matches). Hence making data type uniform.

Since the column TeamLaunch data is not uniform and contain year and duration, creating a new column with capturing the startng year by parsing the data on the column

Convert the object data to numeric so that the functions can be applied

Insert colums below for better analysis
1. Percentage_won = WonGames /PlayedGames *100
2. Percentage_BasketScored = BasketScored/ (BasketScored + BasketGiven) *100
3. Finalist = TournamentChampion + Runner-up
4. percentage_Finalist = (TournamentChampion + Runner-up ) / Tournament *100
5. percentage_TournamentChampion = (TournamentChampion) / Tournament


Analysed basic structure of data
We have completed the data filtering and processing.
We have removed duplicate and null data.
Replaced data.
Processed the duration to start year.
Changed data type.
Introduced new columns for better analysis.
We are not sure here about the missing data '-' , which could have been improved with some numberical value. Columns like TournamentChampion and Runner-up have lots of missing data which makes it difficult to analyse.

-------------------------------------------------------------------------------------------

PART 3

-------------------------------------------------------------------------------------------

• DOMAIN: Startup ecosystem
• CONTEXT: Company X is a EU online publisher focusing on the startups industry. The company specifically reports on the business related to technology news, analysis of emerging trends and profiling of new tech businesses and products. Their event i.e. Startup Battlefield is the world’s pre-eminent startup competition. Startup Battlefield features 15-30 top early stage startups pitching top judges in front of a vast live audience, present in person and online.
• DATA DESCRIPTION: CompanyX_EU.csv - Each row in the dataset is a Start-up company and the columns describe the company. ATTRIBUTE INFORMATION:
    1. Startup: Name of the company
    2. Product: Actual product
    3. Funding: Funds raised by the company in USD
    4. Event: The event the company participated in
    5. Result: Described by Contestant, Finalist, Audience choice, Winner or Runner up
    6. OperatingState: Current status of the company, Operating ,Closed, Acquired or IPO
    *Dataset has been downloaded from the internet. All the credit for the dataset goes to the original creator of the data.
• PROJECT OBJECTIVE: Analyse the data of the various companies from the given dataset and perform the tasks that are specified in the below steps. Draw insights from the various attributes that are present in the dataset, plot distributions, state hypotheses and draw conclusions from the dataset.

1.1 Data warehouse: Read the CSV file.

From the table, its obeserved that there are 6 columns with below characteristics: 

    1. Startup : Object type and Unique values for each data
    2. Product  : Object type and Unique values for each data
    3. Funding  : Funding is continuous value, as we can see we have to process the data as its in non-uniform format
    4. Event : Event is categorical data with different list of event names
    5. Result : Categorical data
    6. OperatingState : Cateforical data

2.1 Check the datatypes of each attribute.

All Data type is currently object

We need the column 'Funding' to be of numerical data type

Data has 662 rows and 6 columns

*

2.2 Check for null values in the attributes.

There are missing data in Product(6) and Funding(214)

3 Data preprocessing & visualisation

3.1 Drop the null values.

3.2 Convert the ‘Funding’ features to a numerical value.

Replace $ in Funding

Here we are replacing the notation K M B with blank first, then extract the decimal and multiply for K with 10^3, M with 10^6 and B with 10^9 . After this we are converting the type as int

3.3 Plot box plot for funds in million.

There are significant outliers in the data. Since box is squeezed due to data and box is not visible. Plotting with only only box plox to having better vissibility on quartiles

3.4 Get the lower fence from the box plot.

Lower fence in box plot : 0.005

3.5 Check number of outliers greater than upper fence.

There are 60 outliers grater than upper fence

3.6 Drop the values that are greater than upper fence.

3.7 Plot the box plot after dropping the values.

3.8 Check frequency of the OperatingState features classes.

275 companies are operating, 56 are closed and 55 are acquired

3.9 Plot a distribution plot for Funds in million.

Data is pretty heavily skewed

3.10 Plot distribution plots for companies still operating and companies that closed.

4 Statistical analysis:

4.1 Is there any significant difference between Funds raised by companies that are still operating vs companies that closed down?

1. Write the null hypothesis and alternative hypothesis.
2. Test for significance and conclusion


ANSWER: 
H0: x̄1 = x̄2, or x̄2 - x̄1 = 0, that is , there is no difference between the sample means
HA: x̄2 < x̄1, or x̄2 - x̄1 < 0

Lets consider a significance level of 5%
α = 0.05 

Hence we can say that we are failing to reject the hypothesis that there is no difference between the sample means

So we can say that there is no evidence found from the data that the company that has more funding succeed more

4.2 Make a copy of the original data frame.

4.3 Check frequency distribution of Result variable.

4.4 Calculate percentage of winners that are still operating and percentage of contestants that are still operating

4.5 • Write your hypothesis comparing the proportion of companies that are operating between winners and contestants:

Write the null hypothesis and alternative hypothesis.
Test for significance and conclusion

Null hyputhesis (Ho): The proportion of companies that are operating is the same in both categories - winners and contestants

Alternative hypothesis (Ha): The proportion of companies that are operating is significantly different from each other, among the two categories

Hence we can say that the winner in the events tend to be more operational, as there is significant different in proportion of companies that are operating and are winner and contestant

4.6 Check distribution of the Event variable.

4.7 Select only the Event that has disrupt keyword from 2013 onwards.

4.8 Write and perform your hypothesis along with significance test comparing the funds raised by companies across NY, SF and EU events from 2013 onwards.

Null Hypothesis(Ho): Average funds raised by companies across three cities are the same : x̄NY = x̄SF = x̄EU, that is , there is no difference between the sample means that is , there is no difference between the sample means

Alternative Hypothesis(Ha): Average funds raised by companies across three cities are the different : x̄NY != x̄SF != x̄EU

Since we fail to reject the null hypothesis, we can say that there is no evidence to claim companies participating in certain regions have funds either significantly on the higher side or on the lower side

4.9 • Plot the distribution plot comparing the 3 city events.

Modes of the three distributions are similar
Dispersion in NY quiet high compared to the others
Overall Distributions look quiet similar to eyes

5 Write your observations on improvements or suggestions on quality, quantity, variety, velocity, veracity etc. on the data points collected to perform a better data analysis.

1. Funding has 214 (32.3%) missing values 
2. Data is pretty heavily skewed
3. Even after removal of lot of outliers the scale was not good enough
4. We do not have absolute numbers to directly use in our tests
5. We have 220 missing data, which can be improved to get more accurate observation
6. The distributions are not normal and have too many outliers
7. Event is highly correlated with OperatingState
8. We have lot of outliers in function which can say that the data is not uniform.